translated by 谷歌翻译
Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.
translated by 谷歌翻译
With the recent advances in video and 3D understanding, novel 4D spatio-temporal challenges fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by lifting the 2D localization results of the sister task Visual Queries with 2D Localization (VQ2D) into a 3D reconstruction. Yet, we point out that the low number of Queries with Poses (QwP) from previous VQ3D methods severally hinders their overall success rate and highlights the need for further effort in 3D modeling to tackle the VQ3D task. In this work, we formalize a pipeline that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. We estimate more robust camera poses, leading to more successful object queries and substantially improved VQ3D performance. In practice, our method reaches a top-1 overall success rate of 86.36% on the Ego4D Episodic Memory Benchmark VQ3D, a 10x improvement over the previous state-of-the-art. In addition, we provide a complete empirical study highlighting the remaining challenges in VQ3D.
translated by 谷歌翻译
The understanding capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models will be released.
translated by 谷歌翻译
A tractogram is a virtual representation of the brain white matter. It is composed of millions of virtual fibers, encoded as 3D polylines, which approximate the white matter axonal pathways. To date, tractograms are the most accurate white matter representation and thus are used for tasks like presurgical planning and investigations of neuroplasticity, brain disorders, or brain networks. However, it is a well-known issue that a large portion of tractogram fibers is not anatomically plausible and can be considered artifacts of the tracking procedure. With Verifyber, we tackle the problem of filtering out such non-plausible fibers using a novel fully-supervised learning approach. Differently from other approaches based on signal reconstruction and/or brain topology regularization, we guide our method with the existing anatomical knowledge of the white matter. Using tractograms annotated according to anatomical principles, we train our model, Verifyber, to classify fibers as either anatomically plausible or non-plausible. The proposed Verifyber model is an original Geometric Deep Learning method that can deal with variable size fibers, while being invariant to fiber orientation. Our model considers each fiber as a graph of points, and by learning features of the edges between consecutive points via the proposed sequence Edge Convolution, it can capture the underlying anatomical properties. The output filtering results highly accurate and robust across an extensive set of experiments, and fast; with a 12GB GPU, filtering a tractogram of 1M fibers requires less than a minute. Verifyber implementation and trained models are available at https://github.com/FBK-NILab/verifyber.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
生物学和人造药物需要处理现实世界中的不断变化。我们在四个经典的连续控制环境中研究了这个问题,并通过形态扰动增强。当不同身体部位的长度和厚度变化时,学习势头是挑战性的,因为需要控制政策才能适应形态以成功平衡和推进代理。我们表明,基于本体感受状态的控制策略的表现差,可以通过高度可变的身体配置,而(甲骨文)代理可以访问学习扰动的编码的(甲骨文)的性能要好得多。我们介绍了DMAP,这是一种以生物学启发的,基于注意力的策略网络体系结构。 DMAP将独立的本体感受处理,分布式策略与每个关节的单个控制器以及注意力机制结合在一起,从不同身体部位到不同控制器的动态门感觉信息。尽管无法访问(隐藏的)形态信息,但在所有考虑的环境中,DMAP都可以端对端训练,整体匹配或超越了Oracle代理的性能。因此,DMAP是从生物运动控制中实施原理的,为学习挑战的感觉运动任务提供了强烈的诱导偏见。总体而言,我们的工作证实了这些原则在挑战运动任务中的力量。
translated by 谷歌翻译
我们介绍了Lavis,这是一个开源深度学习库,用于语言视觉研究和应用。拉维斯(Lavis)的目标是作为一个一站式综合图书馆,它为研究人员和从业人员提供了可访问语言视觉领域的最新进步,并赋予未来的研究和发展。它具有统一的界面,可轻松访问最新的图像语言,视频语言模型和常见数据集。 Lavis支持对各种任务的培训,评估和基准测试,包括多模式分类,检索,字幕,视觉问题答案,对话和预训练。同时,该库还高度可扩展且可配置,从而促进了未来的开发和定制。在此技术报告中,我们描述了图书馆的设计原理,关键组成部分和功能,并在常见的语言视觉任务中提出基准测试结果。该库可在以下网址获得:https://github.com/salesforce/lavis。
translated by 谷歌翻译
我们研究了利润率的二元和多类分类器的精确积极学习。给定一个$ n $ - 点集$ x \ subset \ mathbb {r}^m $,我们想在$ x $上学习任何未知分类器,其类具有有限的strong convex hull保证金,这是一个扩展SVM保证金的新概念。在标准的主动学习环境中,只有标签查询,在最坏的情况下学习具有强凸额的分类器$ \ gamma $需要$ \ omega \ big(1+ \ frac {1} {\ gamma} {\ gamma} \ big big )^{(M-1)/2} $查询。另一方面,使用更强大的种子查询(一种等价查询的变体),可以通过littlestone's缩小算法在$ o(m \ log n)$ Queries中学习目标分类器;但是,减半在计算上效率低下。在这项工作中,我们表明,通过仔细组合两种类型的查询,可以在时间上学习二进制分类器$ \ operatatorName {poly}(n+m)$,仅使用$ o(m^2 \ log n)$ label查询和$ o \ big(m \ log \ frac {m} {\ gamma} \ big)$ seed queries;结果以$ k!k^2 $乘法开销的价格扩展到$ k $ class分类器。当输入点具有界限的位复杂性时,或者仅一个类具有强凸壳边缘时,相似的结果就成立了。我们通过证明在最坏的情况下任何算法需要$ \ omega \ big(k m \ log \ frac {1} {\ gamma} \ big)$ seed $ seed和标签质量质量来学习$ k $ -Class classifier具有强大的凸壳保证金$ \ gamma $。
translated by 谷歌翻译
translated by 谷歌翻译